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We describe an interface and an implementation for performing Kronecker 

^y\ product actions on NVIDIA GPUs for multiple small 2-D matrices and 3-D 

^ arrays processed in parallel as a batch. This method is suited to cases where 

the Kronecker product component matrices are identical but the operands 

O in a matrix-free application vary in the batch. Any batched GEMM (Gen- 
eral Matrix Multiply) implementation, for example ours [1] or the one in 

^ cuBLAS [2], can also be used for performing batched Kronecker products on 

^ GPUs. However, the specialized implementation presented here is faster and 

^ uses less memory. Partly this is because a simple GEMM based approach 

r^ would require extra copies to and from main memory. We focus on matrix 

^ sizes less than or equal to 16, since these are the typical polynomial degrees 

O in Finite Elements, but the implementation can be easily extended for other 

_\ sizes. We obtain 143 and 285 GFlop/s for single precision real when pro- 

IL" cessing matrices of size 10 and 16, respectively on NVIDIA Tesla K20c using 

• ^ CUDA 5.0. The corresponding speeds for 3-D array Kronecker products are 

rS 126 and 268 GFlop/s, respectively. Double precision is easily supported using 

Cd the C-|--|- template mechanism. 
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1. Introduction 

The Kronecker product is an important mathematical concept as well as 
a practical tool in multiple applications [3]. It is useful where either the 
domain or range of a linear operator or the space of unknowns is a "vector 
space" of matrices. It is commonplace for operator evaluation in high-order 
finite elements [4, 5], control theory [6], and when tensor-like decomposition 
is possible for different "directions" [7, 5]. 

Dense Kronecker product action has an additional advantage of possessing 
high arithmetic intensity just like the General Matrix Multiply (GEMM) 
routine in BLAS [8]. This is important on CPUs as well as GPUs [9]. In 
fact, matrix-free action of a Kronecker product can also be cast as a sequence 
of GEMM operations. Thus, modern computational hardware can achieve 
high performance in computing Kronecker product action. This makes it 
important to recognize if a computation can be cast in terms of Kronecker 
products and utilize it just like GEMM. 

Our goal and contribution is to obtain high performance of Kronecker 
product action, possibly even higher than GEMM. We emphasize that we 
never form the explicit product matrix since it is rarely required. In our case, 
the sizes are small (less than or equal to 16) and the matrices are square but 
the method can be easily implemented for rectangular cases and for larger 
sizes. Such sizes correspond to the polynomial degrees and quadrature used 
in high-order finite elements [10, 4, 11, 12, 13]. However, we are unaware of 
any previous work done on GPUs. Part of the reason why this is so can be 
attributed to a lack of very high performance GEMM for batches of small 
matrices, a deficiency that we addressed in [1]. NVIDIA's cuBLAS library 
also contains a batched GEMM implementation [2], which is different than 
ours. 

If high-performance GEMM for small matrices is available, then one might 
deduce that fast Kronecker product action is a solved problem. This reason- 
ing is not entirely correct though. We will show that the performance of even 
a fast GEMM implementation is less than a specialized Kronecker product 
action. This can be due to two separate reasons. In the first case, some com- 
ponents of the Kronecker product do not change in the batch and a general 
GEMM based approach cannot fully take advantage of this. In the second 
case, the domain and range of the action are not (2-D) matrices but 3-D ar- 
rays and more data movement to and from GPU memory reduces the overall 
speed. Both 2-D and 3-D cases arise in practice [4]. 



In order to achieve our goal, we also design a general BLAS-like func- 
tion interface for Kronecker product action. Neither BLAS nor LAPACK 
implement such an interface [8], although it is possible to integrate it easily 
and would fit perfectly in their design philosophy. We design interfaces for 
matrices as well as 3-D arrays. 

We obtain 143 and 285 GFlop/s for single precision real when processing 
matrices of size 10 and 16, respectively on NVIDIA Tesla K20c using CUDA 
5.0. The corresponding speeds for 3-D array Kronecker products are 126 
and 268 GFlop/s, respectively. This performance is appreciably better than 
current batched GEMM performance [1] and justifies the effort required for 
a specialized implementation. 

Here is an outline of the paper. In Section 2 we describe our notation 
for Kronecker products for 2-D as well as 3-D batched inputs. Section 3 
shows a few function interfaces required for the implementation. We give an 
algorithmic overview of our implementation in Section 4. Finally, we present 
the performance of our implementation in Section 5 for various inputs and 
compare it to the performance of batched GEMM. 

2. Kronecker product in 2-D and its generalizations 

The standard way of defining Kronecker product is to define it as an 
output matrix that is computed from two arbitrary real or complex input 
matrices. We work in the complex field below for generality. Let A G C™"^"" 
and B G C™*^"''. The Kronecker product of A and B is denoted by A^ B. 
It is an element of c™"^™6^"''"'i', such that 



A^B 



o-uB ■ ■ ■ ain^B 



aB ■ ■ ■ dmaUaB 



decomposed block-wise. See [14, 15] for it properties and applications. 

Rather than looking at the Kronecker product as a matrix, we, as is also 
common, think of it as a linear operator defined by its two input matrices 
A and B. The domain of the operator A ^ B is C"*"^"" and the range is 
Qmtxma_ fjj-^g operator is defined by its action on a matrix X G C"''^"'". We 
have 

{A ® B)vec{X) = veciBXA^) 



where "vec" is the vectorization operation that stacks columns of a matrix 
to form a vector, and Y G C™''^™" is the output matrix. The superscript T 
stand for transpose without any conjugation. One advantage of this is that 
forming the product matrix, which can be much larger, is not necessary for it 
to be applied. Another advantage is that computing matrix products using 
matrices A and B is faster. This is due to fewer floating point operations 
(by counting flops [3]) and because a BLAS level-3 operation GEMM is used 
in practice. 

2.1. Generalizations to other dimensions 

We now describe generalizations for the Kronecker product action so that 
the input and output need not be restricted to (2-D) matrices. The general- 
ized mapping is from n-D arrays to n-D arrays for n > 1. However, to avoid 
complicating the notation, we work with n < 3 only. One application where 
the 3-D case arises as naturally as the 2-D case is when computing stiffness 
matrix action on tensor product flnite elements [4]. 

First, let us unravel the matrix symbolism present in the 2-D case so that 
the summations are visible. We change our formulation a little so that the 
matrix A is related to the flrst direction in X and the matrix B is related to 
the second direction in X. This will make the generalization natural when we 
proceed to 3-D. We now work with the operator B^Ain the 2-D case, where 
A and B are of the same sizes as mentioned above. But now X G C""^"'' 
and Y = AXB^ G C™"^™''. Expanding this, we get 



ria nf, ria "t, 

I m 



J^ij — / J / J AiiXimBjm — / ^ / ^ AiiBjmXi, 



We generalize this summation to 1-D and 3-D, which means the input "X" 
and output "F" will both be either 1-D or 3-D arrays. Extension to higher 
order products will be obvious. 

In 1-D, we have a single summation as follows. Only matrix A is needed. 



Y, = Y,AaXi 



This case has a more familiar name. It is just a simple matrix-vector multi- 
plication, Y = AX, where Y and X are vectors. 



In 3-D, we have a triple summation as follows. This requires introduction 
of a matrix C G C^^x"-. We have X G C""^"''^"^ and Y G C'"°^'^'^™% 
where 

"a "6 ric 

I m n 

It can be shown that the equation above is an expanded form of the equation 
below. 

vec(r) = (C ® fi ® A)vec(X) 

Here "vec" vectorizes 3-D arrays (X and Y) similar to the 2-D column major 
order vectorization. The order of Kronecker operations in C ^ B ^ A is 
immaterial and thus Kronecker product is associative. 

We hide the vectorization operation notation below by denoting the 1-D 
operation by F = IC{A){X), the 2-D operation by F = IC{A,B){X), and 
the 3-D operation hj Y = IC{A,B,C){X). This emphasizes that we are 
concerned with a linear operation on n-D arrays rather than computing a 
Kronecker product matrix. 

2.2. Batched Kronecker products 

We now describe the batched Kronecker product operation as it needs to 
be implemented in practice. Our choices are inspired by the BLAS/LAPACK 
interface design [8]. 

Let op (standing for operation) refer to a mapping from matrices to ma- 
trices, such that op{A) = A or AJ- or A* depending on an extra variable that 
can contain three values. The superscripts T and * stand for transpose and 
Hermitian-transpose, respectively. Fix real or complex matrices A, 5, and 
C . The matrices can be stored in single precision or double precision. Using 
the BLAS notation, opi^A) is vria x ria, and op{B) is rub x n^, and op{C) is 
rric X Uc. 

Consider the 3-D case first. The inputs are X^ for p = 1,2, ... ,N. Here 
A^ stands for the batch size. Each X is n^ x n^ x Uc. The outputs are Y^, 
each being rUa x mi, x rric, and are computed as follows. 

YP ^a IC{op{A), op{B), op{C)){XP) + (3 Y^. 

Here a and /3 are given scalar values. Note that the matrices A, B, and C 
do not change in the batch. 



The 2-D case is almost the same, except that it is meaningful to have the 
op operation apply on X as well. We then have each op{X) of size n^ x Ub- 
The outputs are Y^, each being ma x nib. We have 

YP ^a IC{op{A), op{B)){op{XP)) + PY". 

The operations op acting on A, B, and C can be different, which leads to 
3^ = 27 combinations in all for the complex case and 2^ = 8 combinations for 
the real case in 3-D. Of course, implementing all these cases will lead to code 
bloat, in which case one might avoid the full generalization and implement 
only the specific cases one is interested in. Another possibility is specific to 
our case. We have A, B, and C as constants and thus the cost of transposing 
them once before using them for all matrices in the batch is relatively minor. 
This avoids all the op related permutations (except those for X in the 2-D 
case) in the implementation. 

3. Batched function interface 

We present BLAS-like function interfaces for 1-D, 2-D, and 3-D batched 
Kronecker products. Each input X and each output Y is stored in column- 
major order, just like the matrices A, B, and C. Most of the naming con- 
vention below would be natural to BLAS and cuBLAS [2] users and is easily 
understood by reading the material presented earlier. Thus, we describe only 
what is new. 

The letter T below stands for type and is for single or double precision 
real or complex. The two variables that need a little explanation are ldx2 
and Idxp (and other similar ones for other matrices). Here ldx2 denotes the 
offset in number of elements between adjacent "planes" in the 3-D array X. 
It is the the number of elements that are stored in the half-open memory 
range corresponding to [Ximi,Xim2), for example. Similarly Idxp denotes 
the offset between adjacent X inputs. This also means that each input and 
output in a batch is stored at uniform offset from previous entity. These 
concepts and arguments for and against them were discussed in detail in [1] . 
Additionally, as discussed there, many code optimizations are possible using 
C-|--|- templates (instead of duplicating code). To avoid losing focus, we do 
not repeat here how C-|--|- templates are used. 

Here is the interface for 1-D. This is almost as if we have implemented a 
batched GEMV operation. Note that having the option for op on X in 1-D 
is just an unnecessary complication. Hence we drop it. 
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void TKRONK 

char transa, 
int ma, int na, 
int batch_count, 
const T* alpha, 
const T* A, int Ida, 
const T* X, int Idp, 
const T* beta, 
T* Y, int Idp) ; 

Here is the interface for 2-D. 

void TKR0N2( 

char transa, char transb, char transx, 

int ma, int na, 

int mb, int nb, 

int batch_count, 

const T* alpha, 

const T* A, int Ida, 

const T* B, int Idb, 

const T* X, int Idx, int Idxp, 

const T* beta, 

T* Y, int Idy, int Idyp) ; 

Finally, here is the interface for 3-D. Note that there is no transx input 
argument. We are not aware of an application that would require it. Besides, 
generalizing transpose for 3-D array cannot just be flipping two indices. 

void TKR0N3( 

char transa, char transb, char transc, 

int ma, int na, 

int mb, int nb, 

int mc, int nc, 

int batch_count, 

const T* alpha, 

const T* A, int Ida, 

const T* B, int Idb, 

const T* C, int Idc, 

const T* X, int Idx, int ldx2, int Idxp, 



const T* beta, 

T* Y, int Idy, int ldy2, int Idyp) ; 

We now give an interface for performing a specialized batched GEMM 
operation where the first matrix A varies across the batch but the second 
matrix B does not. This will be used for implementing TKR0N3 (see Algo- 
rithm 1 ahead). Mathematically, our operation looks like this. Let a and (3 
be given scalars and p = 1,2, . . . ,N. We want to compute 

CP ^a op{AP)op{B) + PCP (1) 

using the following interface. The suffix A specifies that the matrix A varies 
in the batch and thus requires a parameter lda2. 

void TGEMM_A( 

char transa, char transb, 

int m, int n, int k, 

const T* alpha, 

const T* A, int Ida, int lda2, 

const T* B, int Idb, 

const T* beta, 

T* C, int Idc, int ldc2, 

int batch_count, 

int grid_size) ; 

4. CUDA kernel implementation overview 

Before describing a few details relevant to the CUDA based implemen- 
tation, it is useful to see how the 3-D kernel TKR0N3 can be implemented in 
terms of TKR0N2 and the batched TGEMM_A shown earlier. To keep the discus- 
sion simpler, we show this for a non-batched operation, where only one X is 
an input and only one Y is the output. This is shown in Algorithm 1 using 
MATLAB notation. 

We need two CUDA kernel implementations - one for the TKR0N2 oper- 
ation and the other one for TGEMM_A. Both kernel implementations resemble 
the TGEMMjnulti_uniform_kernel implementation we discussed in [1]. Sim- 
ilar to that case, we have two distinct methods. The first one is for matrix 
sizes 1-16. The second can be used for matrices where the square matrix 
dimension can be factorized into a product of two nearly equal numbers. For 



Data: A e C™"''"", B e C'"''''"^ C E C™'=''"'=, X E C-x^bxnc 
Result: Y E C™-^™"^™- where vec(y) = (C ® 5 ® A)vec(X) 

Y ^ zeTos{ma, rriii, rric) 
trap <— zeros(ma, nib, nc) 

% First kron(B,A) is appplied to matrices 'in' X. 

% GEMM operations are used. 

% This step corresponds to TKR0N2. 

for tV = i —)■ nc do 
I tmp{:,:,N)^ A*X{:,:,N)*B' 
end 

% Then C is applied to compute slices of Y. 
% GEMM operations are used again. 
% reshape is an inexpensive MATLAB function. 
% This step corresponds to TGEMM_A. 

for J = i — 7- mb do 

I F(:, J, :) ^ reshape(reshape(tmp(:, J, :), ma, ric) * C", m^, 1, mc) 
end 

Algorithm 1: A sequence of GEMM operations in MATLAB notation 
to compute Kronecker product action on a 3-D array. We implement 
this in CUDA for a batch of inputs in the TKR0N3 interface. 



example, 15 can be factorized as 3 x 5 or 5 x 3. The order is important. 
Experimentally, we saw that the second method is faster than the first one 
for sizes 15 and 16 and we use that to show the results. In both methods, 
each CUDA thread-block is used to process multiple matrices. 

What is more important and different here is that all threads read the 
"constant" matrices in the batch and save them in the shared memory. This is 
an obvious enhancement that helps in getting a better performance compared 
to GEMM. Then, each CUDA block collectively and reads a new X matrix 
and writes the results to a single Y matrix. 

Another issue is that (at least) our implementation of these interfaces re- 
quires temporary storage, or "work arrays" in the Fortran and LAPACK [8] 
parlance. However, we have not explicitly mentioned the details here. Our 
description of the implementation in Algorithm 1 does show how much tem- 
porary storage is needed in our implementation of TKR0N3. We would like to 
stress here that although a work array based interface may look like a hassle 
and a relic when calling the function, it has practical and philosophical ad- 
vantages even when dynamic allocation facilities are present. For example, 
the alternative of memory allocation (and deallocation) inside the routine 
might make it more expensive when large temporary storage is required. We 
noticed the detrimental effect on the speed when we did not use work ar- 
rays and called cudaMemalloc and cudaFree for large temporary storage. 
Dynamic allocation also changes the characteristics of a function. If the 
function could be made "pure", that opportunity is lost. One also has to 
consider multi-threaded environments, error messages in case of memory er- 
rors, multiple devices, and other such responsibilities that are better handled 
at some level higher than a high-performance kernel call [16]. Work arrays 
provide the full generality with only a minor hassle, and one that can be 
easily hidden by writing a small wrapper if one desires. 

5. Performance results 

Our test hardware is the Tesla K20c GPU, which has a peak performance 
of 3.52 TFlop/s and 1.17 GFlop/s in singe and double precision, respectively. 
We work with the EGG (error-correcting code) mode turned off for all cases. 
Our code has been compiled with -arch=sm_35 -02 options using the GUDA 
5.0 toolkit. To stay within the global memory limit of the device for the 
largest matrix size 16, we use 100,000 inputs in the single-precision batch 
and half of that in the double-precision batch. We launch kernels on 5000 
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CUD A kernel thread-blocks. We have not used any transpose or conjugate- 
transpose operation in presenting the results, but the actual implementation 
allows for that. We have computed all the flop rates by using 4m^ and 6m^ 
as the number of flops required to perform Kronecker action in 2-D and 3-D, 
respectively. This holds for real matrices that are of size m. We will show 
results when a = 1 and /3 = 0. Finally, the memory layout of matrices is such 
that all the leading dimension parameters are the smallest they can logically 
be for the given matrix sizes. 

We show the speeds achievable for real single and double precision data 
types in Table 1. Both 2-D and 3-D results are shown there. In general, the 
2-D results are slightly faster than the 3-D result. This is because the 3-D 
result requires two sets of kernel calls and the intermediate result is written 
to and read from the main memory. This is the tmp array described in 
Algorithm 1. 

We also compare the Kronecker product action speeds with the GEMM 
speeds we showed in [1] and the batched GEMM available in cuBLAS. The 
results are in Figure 1 and Figure 2 for single and double precision, respec- 
tively. The motivation is to show that the Kronecker product action can 
be appreciably faster than both our GEMM and the one in cuBLAS. Note 
that if GEMM were used to implement Kronecker product, then the effective 
speed would be even lower than that of GEMM because of redundant data 
movement. The figures show that the best possible speeds (without such a 
reduction) are much lower than our specialized implementation. 

6. Discussion 

We have described a BLAS-like interface for a batched Kronecker product 
routine generalized to 1-D, 2-D, and 3-D arrays. The implementation and 
results are for 2-D and 3-D only. Our implementation has assumed that 
the component matrices that form the Kronecker product are identical in 
the batch. This is obviously a design choice and based on our application 
requirement. If desired, one can similarly implement the computationally 
more expensive and general case where one or more of A, B, or C matrix 
changes in the batch. 

We have focused on NVIDIA GPUs and used CUD A throughout. How- 
ever, the implementation does not use any special features that might not 
be available on other many-core hardware via OpenCL, for example. A new 
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Size 


Single-2 


Single-3 


Double-2 


Double-3 


1 


1 


1 


1 


1 


2 


7 


6 


6 


5 


3 


21 


17 


19 


15 


4 


47 


38 


38 


32 


5 


80 


63 


63 


51 


6 


86 


67 


78 


61 


7 


122 


100 


109 


90 


8 


177 


146 


156 


135 


9 


138 


122 


89 


88 


10 


143 


126 


108 


103 


11 


171 


151 


134 


128 


12 


169 


153 


145 


137 


13 


166 


149 


151 


137 


14 


162 


150 


153 


141 


15 


202 


187 


160 


154 


16 


285 


268 


152 


169 



Table 1: The GFlop/s rates when computing action of independent Kronecker products of 
various sizes and precisions using the TKR0N2 and TKRDN3 interfaces on an NVIDIA Tesla 
K20c. We use 100,000 inputs for single precision and 50,000 inputs for double precision 
(to stay within global memory limit of the device for the largest matrix size 16). The 
comparison with corresponding numbers for a GEMM baseline are made in Figure 1 and 
Figure 2. 
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-A-Single-GEMM 
^Single-cuBLAS 
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Figure 1: Performance obtained for single precision TKR0N2 and TKR0N3 when computing 
on 100,000 batched inputs with a = 1 and /3 = on an NVIDIA Tesla K20c. We show 
figures for our GEMM from [1] for comparison and the batched GEMM implemented in 
cuBLAS [2]. 




-^Double-KRON3 
*Double-KRON2 
-A-Double-GEMM 
^Double-cuBLAS 
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Figure 2: Performance obtained for double precision TKR0N2 and TKR0N3 when computing 
on 50,000 batched inputs with a = 1 and /3 = on an NVIDIA Tesla K20c. We show 
figures for our GEMM from [1] for comparison and the batched GEMM implemented in 
cuBLAS [2]. 
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implementation will allow devices from other vendors with little or no change 
to the interface. 
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